CFAES Bioinformatics Core, OSU
2026-02-05
The transcriptome is the full set of transcripts expressed by an organism, which:
Is not at all stable across time & space in any given organism
(unlike the genome but much like the proteome)
Varies both qualitatively (which transcripts are expressed) but especially quantitatively (how much of each transcript is expressed)
Transcriptomics is the study of the transcriptome,
i.e. the large-scale study of RNA transcripts expressed in an organism.
Many approaches & applications — but most commonly, transcriptomics focuses on:
https://hbctraining.github.io
Considering…
That protein production gives clues about the activity of specific biological functions, and the molecular mechanisms underlying those functions;
That it is much easier to measure transcript expression than protein expression at scale;
The central dogma
… we can use gene expression levels as a proxy for protein expression levels and make functional inferences.
Specifically, we can use transcriptomics to:
RNA-Seq is the current state-of-the-art family of methods to study the transcriptome.
It involves the random sequencing of millions of transcript fragments per sample.
We will focus on the most common type of RNA-Seq, which:
RNA-Seq data can also be used for applications other than expression quantification:
For organisms without a reference genome: identify genes present in the organism
For organisms with a reference genome: discover new genes & transcripts,
and improve genome annotation
All in all, RNA-Seq is a very widely used technique —
it constitutes the most common usage of high-throughput sequencing!
RNA-Seq is also the most common data type I assist with as an MCIC bioinformatician. Some projects I’ve worked on used it to identify genes & pathways that differ between:
Multiple soybean cultivars in response to Phytophtora sojae inoculation; soybean in response to different Phytophtora species and strains (Dorrance lab, PlantPath)
Wheat vs. Xanthomonas with a gene knock-out vs. knock-in (Jacobs lab, PlantPath)
Mated and unmated mosquitos (Sirot lab, College of Wooster)
Tissues of the ambrosia beetle and its symbiotic fungus (Ranger lab, USDA Wooster)
Diapause-inducing conditions for two pest stink bug species (Michel lab, Entomology)
Human carcinoma cell lines with vs. without a manipulated gene (Cruz lab, CCC)
Pig coronaviruses with vs. without an experimental insertion (Wang lab, CFAH)
And to improve the annotation of a nematode genome (Taylor lab, PlantPath)
RNA-Seq typically compares groups of samples defined by differences in:
Treatments (e.g. different host plant, temperature, diet, mated/unmated) and/or
Organismal variants: ages/developmental stages, sexes, or genotypes (lines/biotypes/subspecies/morphs) and/or
Tissues
https://github.com/ScienceParkStudyGroup/rnaseq-lesson
To be able to make statistically supported conclusions about expression differences between such groups of samples, we must have biological replication.
When designing an RNA-Seq experiment, keep the following in mind:
Technical replicates?
You won’t need technical replicates that only replicate library prep and/or sequencing, but depending on your experimental design, may want to technically replicate something else.
This paper uses RNA-seq data to study gene expression in Culex pipiens mosquitos infected with malaria-causing Plasmodium protozoans — specifically, it compares mosquitos according to:
https://sydney-informatics-hub.github.io/training-RNAseq-slides
Fig. from Kukurba & Montgomery 2015
Guidelines highly approximate (cf. in genomics) — depends not just on transcriptome size; also on expression level distribution, expression levels of genes of interest, etc.
Typical recommendations are 20-50 million reads per sample (more for e.g. transcript-level inferences)
For statistical power, more replicates are better than a higher sequencing depth:
Fig. from Liu et al. 2014
Modified after Kukurba & Montgomery 2015
You will typically receive a “demultiplexed” (split-by-sample) set of FASTQ files.
Once you receive your data, the first series of analysis steps involves going from the raw reads to a count table (which will have a read count for each gene in each sample).
This part is bioinformatics-heavy with large files, a need for lots of computing power such as with a supercomputer, command-line (Unix shell) programs — it specifically involves:
Read preprocessing
Aligning reads to a reference genome (+ alignment QC)
Quantifying expression levels
This can be run using standardized, one-size-fits-all workflows, and is therefore (relatively) suitable to be outsourced to a company, facility, or collaborator.
Read pre-processing includes the following steps:
The alignment of reads to a reference genome needs to be “splice-aware”.
Alternatively, you can align to the transcriptome (i.e., all mature transcripts):
At heart, a simple counting exercise once you have the alignments in hand.
But made more complicated by sequencing biases and multi-mapping reads.
Current best-performing tools (e.g. Salmon) do transcript-level quantification — even though this is typically followed by gene-level aggregation prior to downstream analysis.
Fast-moving field
Several very commonly used tools like FeatureCounts (>15k citations) and HTSeq (<18k citations) have become disfavored in the past couple of years, as they e.g. don’t count multi-mapping reads at all.
The “nf-core” initiative (https://nf-co.re) attempts to produce best-practice and automated workflows/pipelines, like for RNA-Seq (https://nf-co.re/rnaseq):
The second part of RNA-Seq data analysis involves analyzing the count table.
In contrast to the first part, this can be done on a laptop and instead is heavier on statistics, data visualization and biological interpretation.
It is typically done with the R languange, and common steps include:
Principal Component Analysis (PCA)
Assessing overall sample clustering patterns
Differential Expression (DE) analysis
Finding genes that differ in expression level between sample groups (DEGs)
Functional enrichment analysis
See whether certain gene function groupings are overrepresented among DEGs
A PCA analysis will help to visualize overall patterns of similarity among samples,
for example whether our groups of interest cluster:
Fig. 1 from Garrigos et al. 2023
A Differential Expression (DE) analysis allows you to test, for every single expressed gene in your dataset, whether it significantly differs in expression level between groups.
Typically, this is done with pairwise comparisons between groups:
A Differential Expression (DE) analysis allows you to test, for every single expressed gene in your dataset, whether it significantly differs in expression level between groups.
Typically, this is done with pairwise comparisons between groups:
Gene count normalization
To be able to fairly compare samples, raw gene counts need to be adjusted:
R packages to the rescue
Specialized R/Bioconductor packages like DESeq2 and EdgeR make differential expression analysis relatively straightforward and automatically take care of the abovementioned considerations (we will use DESeq2 in the lab).
Lists of DEGs can be quite long, and it is not always easy to make biological sense of them. Functional enrichment analyses help with this.
Functional enrichment analyses check whether certain functional categories of genes are statistically overrepresented among up- and/or downregulated genes.
There are a number of databases that group genes into functional categories, but the two main ones used for enrichment analysis are:
Fig. 4 from Garrigos et al. 2023
KEGG focuses on pathways for cellular and organismal functions whose genes can be drawn and connected in maps.
Rodriguez et al. 2020: “KEGG representation of up-regulated genes related to jasmonic acid (JA) signal transduction pathways (ko04075) in banana cv. Calcutta 4 after inoculation with Pseudocercospora fijiensis. Genes or chemicals up-regulated at any time point were highlighted in green.”